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Abstract: Bayesian alternatives to frequentist propensity score approaches have recently been pro¬ 
posed. However, few studies have investigated their covariate balancing properties. This article com¬ 
pares a recently developed two-step Bayesian propensity score approach to the frequentist approach 
with respect to covariate balance. The effects of different priors on covariate balance are evaluated 
and the differences between frequentist and Bayesian covariate balance are discussed. Results of the 
case study reveal that both the Bayesian and frequentist propensity score approaches achieve good 
covariate balance. The frequentist propensity score approach performs slightly better on covariate 
balance for stratification and weighting methods, whereas the two-step Bayesian approach offers 
slightly better covariate balance in the optimal full matching method. Results of a comprehensive 
simulation study reveal that accuracy and precision of prior information on propensity score model 
parameters do not greatly influence balance performance. Results of the simulation study also show 
that overall, the optimal full matching method provides the best covariate balance and treatment effect 
estimates compared to the stratification and weighting methods. A unique feature of covariate balance 
within Bayesian propensity score analysis is that we can obtain a distribution of balance indices in 
addition to the point estimates so that the variation in balance indices can be naturally captured to 
assist in covariate balance checking. 
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INTRODUCTION 

It is well-established that in randomized experiments, individuals are assigned to treatment 
conditions with a known probability. By contrast, in observational studies, individuals self¬ 
select into treatment conditions on the basis of an unknown mechanism. Thus, the selection 
process is often highly nonrandom, introducing selection bias that may result in highly 
unbalanced covariates (i.e., greatly different covariate distributions in the treatment group 
and control group) and thus severely weakening causal inferences. The best that can be 
hoped for is that the investigator has obtained measurable and reliable covariates that relate 
to the selection mechanism and the potential outcomes, recognizing that unobservable 
covariates might still be operating to bias treatment effect estimates. Establishing balance 
between the treatment and control groups on observable covariates is thus essential for 
obtaining unbiased treatment effect estimates. 
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In a classic article, Rosenbaum and Rubin (1983) proposed propensity score analysis 
as a practical tool for reducing selection bias through balancing on measured covariates, 
where the propensity score is a scalar function of covariates so that subjects who match on 
their propensity scores can be treated as having similar covariate background. A variety of 
propensity score techniques have been developed for both the estimation and the application 
of the propensity score. Models for estimating the propensity score equation have included, 
for example, parametric logit regression with chosen interaction and polynomial terms 
(e.g., Dehejia & Wahba, 1999; Hirano & Imbens, 2001), and nonparametric generalized 
boosting modeling (McCaffrey, Ridgeway, & Morral, 2004). Methods for estimating the 
treatment effect while accounting for the propensity score include stratification, weighting, 
matching, and regression adjustment. Each of these propensity score techniques have a 
common goal, which is to achieve balanced covariate distributions across the different 
treatment conditions. Covariate balance after propensity score adjustment, therefore, is 
often evaluated to assess the effectiveness of propensity score techniques (see, e.g., Harder, 
Stuart & Anthony, 2010). 

There have been numerous research studies examining the adequacy of covariate 
balance. For example, Rubin (2001) suggested three criteria for gauging balance between 
the treatment and control group, namely, small mean differences on propensity scores 
between the treatment and control group, similar variances of the propensity scores between 
treatment and control conditions, and similar residual variances of the covariates after the 
propensity score adjustment (the variance ratio should be close to one) in the two groups. 
Later, Austin and Mamdani (2006) demonstrated various indices to evaluate covariate 
balance, such as standardized differences and nonparametric density estimates. Among 
approaches for assessing covariate balance, statistical hypothesis testing has often been 
utilized in practice. However, there are two major complaints shared by Imai, King, and 
Stuart (2008) and Austin (2008) with regard to the validity of balance testing: (a) covariate 
balance is a sample property, and thus hypothesis testing is not relevant, and (b) sample- 
size reduction in matched samples can lead to less significant imbalance results even if 
the absolute imbalances remain constant. Agreement surrounding these concerns, though, 
has not been fully reached (Hansen, 2008; Hill, 2008; Stuart, 2008). In this context, a 
Bayesian perspective is more appealing because it can naturally avoid both problems and 
is the subject of this article. 

Propensity score analysis has been mainly developed within the frequentist framework 
of statistics. An alternative Bayesian perspective for propensity score analysis was originally 
advocated by Rubin (1985). McCandless, Gustafson, and Austin (2009) first provided a 
practical Bayesian propensity score approach that stratifies individuals on the quintiles of 
the estimated propensity score, treating the propensity score as a latent variable and jointly 
estimating the propensity score and outcome models. McCandless, Gustafson, Austin, and 
Levy (2009) examined the covariate balance of their joint modeling approach via a case 
study of the effectiveness of beta-blocker therapy in heart failure patients. Their Bayesian 
stratification approach with simultaneous estimation of the propensity score and treatment 
effect was shown to provide poorer balance than the conventional frequentist counterpart, 
but it did reduce the association between the covariates and outcome and thus reduced the 
confounding effects of unbalanced covariates. 

A concern surrounding the work of McCandless, Gustafson, and Austin (2009) cen¬ 
ters on the joint modeling of the propensity score and outcome equations. Specifically, 
the propensity score should only be determined by covariates measured prior to treatment 
implementation and not influenced by the outcome measured after treatment assignment 
Steiner and Cook (2013). To address this problem, Kaplan and Chen (2012) developed a 
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two-step Bayesian propensity score approach, fitting the propensity score model and out¬ 
come model separately and examining its performance in regard to treatment effect and 
variance estimates via propensity score stratification, weighting, and optimal full match¬ 
ing methods. However, the balance properties of the two-step Bayesian propensity score 
approach have not yet been evaluated. 

This article investigates the balance properties of the two-step Bayesian propensity 
approach (Kaplan & Chen, 2012) via a case study and a simulation study and demonstrates 
a way of assessing covariate balance within the Bayesian framework. The link between 
covariate balance and treatment estimation is also examined and discussed. The uniqueness 
of evaluating covariate balance in Bayesian propensity score approaches is that the uncer¬ 
tainty of balance indices can be naturally accounted for through the posterior distribution 
of propensity score model parameters. The posterior probability intervals (i.e., credible 
intervals) of balance indices can be acquired to judge the adequacy of covariate balance 
beyond simply point estimates. Concern (a), just mentioned with regard to balance tests, is 
irrelevant here because a Bayesian perspective draws inferences on the model parameters 
conditional on the sample and does not involve a hypothetical population from which the 
sample is derived. Concern (b) with regard to the impact of sample-size reduction accom¬ 
panying matching on balance tests does not apply either because for Bayesian propensity 
score matching, uncertainty in the balance indices is captured by posterior distributions 
based on all the subjects, unlike frequentist hypothesis testing that examines significance 
of balance indices in matched samples with reduced sample sizes. 

For the purpose of completeness, this article first briefly introduces the framework 
of propensity score analysis and its implementation. Then the following section presents 
Bayesian propensity score analysis and provides a comprehensive review of the Kaplan and 
Chen’s (2012) two-step Bayesian propensity score approach. Next, the study design as well 
as results of a case study are presented and discussed, followed by the design and results 
of a simulation study. Finally, the conclusion section summarizes the article and discusses 
the pros and cons of covariate balance within Bayesian propensity score approaches. 


PROPENSITY SCORE APPROACH AND ITS IMPLEMENTATION 

Throughout this article, we consider the simple case of a two-group problem, where indi¬ 
viduals self-select into a treatment group or a control group. The propensity score is defined 
to be the conditional probability that a participant selects into the treatment group given a 
vector of covariates measured prior to treatment selection. Following the general notation 
of the Rubin causal model (Rubin, 1974), let Yu be the potential outcome for individual 
i if the individual receives the treatment and To/ be the potential outcome for individual i 
if the individual does not receive the treatment. The treatment indicator is denoted as T, 
where for individual i, T l = 1 if the individual was exposed to the treatment, and 7} = 0 
if the individual was not exposed to the treatment. The propensity score of individual i 
given a vector of observed covariates x can be expressed as e, (x ,) — P(T) = I v, )• For the 
treatment effect estimation based on the population of individuals, there are two causal 
estimands that may be of interest. One is the average treatment effect (ATE), defined as 


YATE = E(Yu) - E(Yoi), ( 1 ) 

which is the mean difference between a unit assigned to the treatment condition and a unit 
from the same population but assigned to the control condition. The other causal estimand 
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is the average treatment effect on the treated (ATT), defined as 


y ATT = E ( Y\j|Tj = 1) - E(Y 0i \Ti = 1), (2) 

which assumes that an individual is assigned to the treatment group and the question 
concerns the outcome of that individual had he or she been assigned to the control group. 
In this article, we focus on estimates of the ATE. 

All propensity score techniques rely on the assumption of strong ignorability (Rosen¬ 
baum & Rubin, 1983), which states that the potential outcomes are assumed to be inde¬ 
pendent of treatment assignment given observed covariates. If strong ignorability does not 
hold, then hidden biases due to unobserved covariates may exist. Under strong ignorability, 
participants with the same propensity score are expected to have the same distribution on the 
set of covariates, here referred to as “covariate balance.” Assuming strong ignorability and 
that covariate balance obtains, estimates of treatment effects from nonrandomized studies 
approximate those that would have been obtained if a randomized study was conducted. In 
practice, however, the true propensity score is unknown, and instead propensity scores are 
estimated by fitting a model such as a logistic regression model of the treatment assign¬ 
ment on the measured covariates. Then, based on the estimated propensity scores, different 
techniques, including stratification (Rosenbaum & Rubin, 1983, 1984), inverse-propensity 
weighting (Rosenbaum, 1987; Wooldridge, 2002), matching (Hansen, 2004; Rosenbaum, 
1989), and regression estimation (Rubin, 1979), are utilized in the outcome model to obtain 
the estimate of the causal effect of interest. 

As noted, there are several common methods for implementing a propensity score 
adjustment. The propensity score stratification method (Rosenbaum & Rubin, 1983) incor¬ 
porates the idea of Cochran (1968) that five subclasses can remove as high as 90% of the 
bias due to the subclassifying covariate. The stratification method first sorts all the partic¬ 
ipants by their estimated propensity scores e(x ) and then stratifies them into five or more 
strata at the quintiles or certain quantiles of the propensity score distribution. Estimates 
of the treatment effect within each stratum are calculated and the overall causal effect 
is obtained by averaging over five or more within-strata treatment effects. This method 
has seen increased popularity in educational, psychological, and medical research (e.g., 
Ayanian, Landrum, Guadagnoli, & Gaccione, 2002; Leow, Marcus, Zanutto, & Boruch, 
2004; Swanson et al., 2007). 

The inverse propensity score weighting approach is based on the Horvitz-Thompson 
estimator Horvitz and Thompson (1952) from the survey sampling literature. The inverse- 
propensity weighting approach weights participants in treatment group and control group 
by 1 /e(x) and 1/(1— e(x)), respectively, to balance these groups on covariates. 1 The details 
of this approach can be found in Hirano and Imbens (2001); Hirano, Imbens, and Ridder 
(2003); and Lunceford and Davidian (2004). 

Another technique, known as propensity score matching, matches participants in the 
treatment group to the control group on the estimated propensity score. Various matching 
methods have been explored in the context of propensity score analysis such as nearest- 
neighbor matching, caliper matching, Mahalanobis metric matching, and optimal matching 
(see, e.g.. An, 2010; Hansen, 2004). Matching types can be one-to-one, one-to-many, many- 
many matching, and so on. The optimal matching method is “optimal” in the sense that each 

1 This particular weight yields the average treatment effect. One could weight by T + (1 — 7) , 

where T = 1 if the individual is assigned to the treatment group T = 0, if not. This weight provides 
an estimate of the average treatment effect on the treated. 
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match can be revisited to minimize the total distance between all the matches rather than 
removing the matched subjects from further consideration in the greedy matching method 
(Rosenbaum, 1989). Examples of the matching approach can be found in Foster (2003); 
Dehejia and Wahba (2002); and Guo, Barth, and Gibbons (2006). 

Finally, the regression adjustment approach or the covariate-adjustment approach di¬ 
rectly includes the estimated propensity score linearly (and possibly nonlinearly) in the 
outcome equation (Kang & Schafer, 2007; Schafer & Kang, 2008). The regression adjust¬ 
ment approach is easy to implement but relies more on the correct specification of the 
outcome model compared to other propensity score methods. Applications of this approach 
can be found in Kurth et al. (2006) and Shadish et al. (2008). 


BAYESIAN PROPENSITY SCORE APPROACH 

Over the past two decades, significant advances have been made in the area of Bayesian 
statistical inference, owing mostly to computational developments and readily available 
software (e.g., Gilks, Richardson, & Spiegelhalter, 1996). These computational advances 
have led to increased applications of Bayesian methods to problems in the social and 
behavioral sciences. As is well known, the Bayesian perspective begins by specifying a 
model for an outcome of interest, elicits prior distributions for all model parameters, and 
obtains the joint posterior distribution of the model parameters given the data via Bayes’ 
theorem and some chosen computing algorithm, such as the Markov chain Monte Carlo 
(MCMC) methods including the Metropolis-Hastings algorithm, the Gibbs sampler, and so 
on. For a general review specific to the social and behavioral sciences, see Kaplan (2014). 

Rubin (1985) argued that a Bayesian approach to propensity score analysis should be of 
great interest to the applied Bayesian analyst, and yet propensity score estimation within the 
Bayesian framework was not addressed until relatively recently. Hoshino (2008) developed 
a quasi-Bayesian estimation method for general parametric models, such as latent variable 
models, and developed an MCMC algorithm to estimate the propensity score, McCandless, 
Gustafson, and Austin (2009) first provided a practical Bayesian approach to propensity 
score stratification, estimating the propensity score and the treatment effect and sampling 
from the joint posterior distribution of model parameters via an MCMC algorithm. The 
marginal posterior probability of the treatment effect can then be obtained based on the 
joint posterior distribution. Similar to the McCandless, Gustafson, and Austin (2009) study, 
An (2010) presented a Bayesian approach that jointly models both the propensity score 
equation and outcome equation at the same time and extended this one-step Bayesian 
approach to propensity score regression and single nearest neighbor matching methods. 

A consequence of the Bayesian joint modeling procedure utilized by McCand¬ 
less, Gustafson, and Austin (2009) and An (2010) is that the propensity score esti¬ 
mates may be affected by the outcome variable that are observed after treatment as¬ 
signment, resulting in biased propensity score estimation. In fact, Zigler et al. (2013) 
assessed the joint estimation of the Bayesian propensity score and the treatment ef¬ 
fect and found that the feedback between propensity score model and outcome model 
can lead to poor treatment effect estimates. This model feedback is especially prob¬ 
lematic if the relationship between the outcome and the propensity score is mis- 
specified (McCandless, Douglas, Evans, & Smeeth, 2010). To solve this problem, 
McCandless, Douglas, Evans, and Smeeth (2010) utilized an approximate Bayesian tech¬ 
nique introduced by Lunn et al. (2009) for preventing undesirable feedback between propen¬ 
sity score model and outcome model components. Specifically, McCandless, Douglas, 
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Evans, and Smeeth (2010) included the posterior distribution of the propensity score pa¬ 
rameters as covariate input in the outcome model so that the flow of information between 
the propensity score and the outcome is restricted. This so-called sequential Bayesian 
propensity score analysis yields treatment effect estimates that are comparable to estimates 
obtained from frequentist propensity score analysis. Nevertheless, as McCandless, Dou¬ 
glas, Evans, and Smeeth (2010) pointed out, their method is only approximately Bayesian 
and also encounters the difficulty that the Markov chain is not guaranteed to converge. 


The Two-Step Bayesian Propensity Score Approach 

To maintain a fully Bayesian specification while overcoming the conceptual and practical 
difficulties of the joint modeling methods of McCandless, Gustafson, and Austin (2009) 
and An (2010), a two-step Bayesian propensity score approach was recently developed by 
Kaplan and Chen (2012) that can incorporate prior information on the model parameters of 
both the propensity score equation and outcome model equation. Consistent with Bayesian 
theory (see, e.g., De Finetti, 1974), specifying prior distributions on the model parameters 
is a natural way to quantify uncertainty—here in both the propensity score and outcome 
equations. 

In the Kaplan and Chen (2012) two-step Bayesian propensity score approach (BPSA), 
the propensity score model is fit in the first step, specified as the following logit model. 

Los {t?b> 0 = " + ^' (3) 

where a is the intercept, ft refers to the slope, and x represents a design matrix of chosen 
covariates. For BPSA, the R package MCMClogit (Martin, Quinn, & Park, 2010) was 
utilized to sample from the posterior distribution of a and ft using a random walk Metropolis 
algorithm. Other R packages such as arm (Gelman et al., 2011) can also be applied to 
draw posterior samples from the posterior distribution. After the posterior samples of the 
propensity score are obtained, a Bayesian linear model is fit in the second step as the outcome 
model to obtain the posterior draws of the treatment effect via various propensity score 
methods such as stratification, weighting, optimal matching, and regression adjustment. 
We refer to the Bayesian linear model in the second step as the Bayesian outcome model. 
An illustration of the two-step Bayesian propensity score approach is shown in Figure 1. 

For illustration purposes, consider a posterior sampling procedure of a chosen Bayesian 
logit model with 1, 000 iterations and a thinning interval of 1. Then for each observation, 
there are m = 1, 000 posterior propensity scores e(x) calculated from the posterior samples 
of propensity score model parameters a and ft as follows: 

Kx)= " P( ° + W . (4) 

1 + exp(a + ft'x) 

Based on each posterior propensity score, there are J = 1,000 posterior draws of 
the treatment effect generated from the posterior distribution of y (i — 1 ,..., m, j — 
1,...,/), where y is the treatment effect. For notational simplicity, let;; denote the vector 
of propensity score model parameters. Kaplan and Chen (2012) then provide the following 
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1. Posterior distribution of parameters 2. Posterior distribution of parameter in the Bayesian 
in the Bayesian propensity score model outcome model based on propensity score parameters 

Figure 1. Illustration of the two-step Bayesian propensity score approach. 

treatment effect estimator: 


m J 

E(y | x, y, T ) = m~ l J~ l EE YjOli). 


1=1 7=1 


( 5 ) 


where J~ l Yjitfi) i s the posterior sample mean of treatment effect y in the Bayesian 
outcome model based on the /th set of propensity score model parameters i], and then this 
posterior sample mean is averaged over m sets of posterior propensity scores. The posterior 
variance of y is based on the total variance formula as follows: 

m m I >n 1 

Var(y \ x, y, T) = m~ l ^ rf (m) + (.m - l) -1 ^ l F Y ( m ) - m~ l ^ Fyim) | . (6) 

i= 1 i =1 [ i =1 J 

where 

j 

Yjim)- J ~' EK;^) 

7=1 


’vim) 


= (y - l )- 1 E 


7=1 



is the posterior sample variance of y in the Bayesian outcome model under the i th set of 
propensity scores and 


j 

llyfri,) = J ~' Ek,^). (8) 

7=1 

is the posterior sample mean of y in the same Bayesian outcome model. Note that Equation 
6 captures two sources of variation. The first source of variation is the average of the 
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posterior variances of treatment effect y across the posterior samples of propensity scores, 
represented by the first part of the right hand side of Equation 6, and the second source of 
variation comes from the variance of the posterior means of treatment effect y across the 
posterior samples of propensity scores, obtained by the second part of the right of hand side 
of Equation 6. 

Utilizing the aforementioned estimators, Kaplan and Chen (2012) conducted three 
comprehensive simulation studies as well as a small case study comparing frequentist 
propensity score analysis with the two-step Bayesian alternative focusing on the treatment 
effect and variance estimates. The effects of different sample sizes, true treatment effects and 
choice of priors on treatment effect and variance estimates were also evaluated. Consistent 
with Bayesian theory, Kaplan and Chen’s findings showed that when no prior information 
is available, specifying a larger prior variance of treatment effect is desirable to obtain 
estimates similar to frequentist results but with more accurate intervals; when accurate 
prior information with regard to the treatment effect is attainable, specifying a smaller prior 
variance is preferable in order to obtain more precise treatment effect estimates. For the 
case of small sample size, the Bayesian approach shows slight superiority in the estimation 
of the treatment effect compared to the frequentist counterpart. 

Kaplan and Chen (2012) studied the treatment effect and variance estimates of the 
two-step Bayesian propensity score approach, but the covariate balance of their approach 
has not been examined. Therefore, this article investigates covariate balance of the newly 
developed two-step full Bayesian propensity score approach under stratification, weighting, 
and optimal full matching methods via a case study and a real-data-based simulation 
study. We demonstrate a practical way of assessing the balance properties of the Bayesian 
propensity score approach via the point estimates as well as the posterior distribution of 
balance indices and also link the covariate balance results with the accuracy of treatment 
effect estimates. In addition, we discuss the unique features of the Bayesian approach in 
terms of covariate balance. 


DESIGN OF THE CASE STUDY 

A case study is conducted to examine properties of the two-step Bayesian propensity 
score approach. The data for this case study come from the Early Childhood Longitudinal 
Study Kindergarten cohort (ECLS-K) of 1998 (NCES, 2001). The sampled children at¬ 
tended either full-day or part-day kindergarten programs and have diverse socioeconomic 
and racial/ethnic backgrounds. Also, a number of variables assessing early childhood 
(pre-K) experiences were collected in the ECLS-K; thus propensity score approaches can 
be fruitfully applied. This article investigates the treatment effect of full-day versus part- 
day kindergarten attendance on children’s reading achievement at the end of 1998 fall 
kindergarten. 

A sample of 1,000 children were randomly selected proportional to the number of 
children in full-day or part-day kindergarten in the population. This resulted in 538 children 
in full-day programs and 462 children in part-day programs. Fourteen covariates were 
chosen for estimating the propensity scores, including gender, race, mother’s employment 
status, child’s age at kindergarten entry, child’s age at first nonparental care, primary 
type of nonparental care, language spoken to child by parent, number of siblings, family 
composition, mother’s employment between child’s birth and kindergarten, number of 
nonparental care arrangements pre-K, social economic status, parent’s expectation of child’s 
degree, and how often parent reads to child. All analyses utilized various packages within 
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the R software environment (R Development Core Team 2011; described next), and missing 
data were handled via multiple imputations using the R program mice package (Van Buuren 
& Groothuis-Oudshoorn, 2011). Default settings in the mice program were used. 

The procedure for assessing covariate balance is as follows. First, propensity scores 
are estimated via a Bayesian logit model. We use the “bayesglm" function of the arm 
package (Gelman et ah, 2011) in R that adopts an approximate EM algorithm to update the 
parameters at each step to obtain the posterior samples of parameters in the Bayesian logit 
model. To handle problematic covariates, the “bayesglm” package uses a default Student’s 
t prior for all regression coefficients. The posterior sampling runs 1,000 iterations, and thus 
we obtain 1,000 sets of propensity score estimates in the posterior sample. We then apply 
the estimated propensity scores to balance on covariates via propensity score stratification, 
weighting and optimal full matching methods. After having the matched strata/subgroups 
based on estimated propensity scores for stratification and optimal full matching, stratum 
frequency based weights are utilized to calculate weighted averages of covariates across 
matched strata/subgroups for the treatment group and the control group separately. Then 
the balance indices are obtained through taking the difference or the ratio of the weighted 
averages between the treatment group and the control group. In terms of the weighting 
method, inverse propensity score weights are used in the same way as stratum weights 
above to obtain estimated balance indices. For the optimal full matching method, the 
“optmatch” package in R (Hansen & Klopfer, 2006) is adopted. 

For both the case study and the simulation study described next, the balance of contin¬ 
uous covariates and categorical covariates is evaluated separately. The balance indices used 
in this study are the standardized mean/proportion difference (Cohen’s d\ Cohen, 1988) and 
variance ratio for each continuous covariate/each level of categorical covariates between 
treatment group and control group. Specifically, the standardized mean difference for a 
continuous covariate is obtained by 

B x = (x, - x c )/y/{s} + sfj/2, (9) 

where x, and x c are the sample mean of each covariate in treatment and control groups, 
respectively, and s} and s} : are corresponding sample variances. The variance ratio for a 
continuous covariate, R \, is defined as s} /,v ( 2 . All the categorical covariates are dummy 
coded. Then for each categorical level, we evaluate the standardized difference in propor¬ 
tions between different treatment conditions, consistent with Harder, Stuart and Anthony 
(2010). The standardized proportion difference is calculated by 

B 2 = (Pt ~ Pc)/y/\Pt(l ~ Pt ) + Pci 1 - Pc)V 2 > (10) 

where p t and p c are proportions of participants in the treatment group and control group, 
respectively, for a specific level of categorical covariates. The variance ratio for a certain 
categorical level, Ro, is calculated by p,( 1 — p,)/p c (l — p c ). 

In addition to the aforementioned point estimates, the Bayesian propensity score 
approach provides the 95% posterior probability intervals (PPI) of the standardized 
mean/proportion difference and the variance ratio based on the posterior distribution of 
the propensity score parameters. For each set of estimated propensity scores in the poste¬ 
rior sample, we can obtain a point estimate of each balance index. Because we have 1,000 
sets of estimated propensity scores, a distribution of the balance index with 1,000 points can 
be obtained, and we extract the mean of this posterior distribution as the point estimate of 
the covariate balance as well as the 2.5 and 97.5 percentiles to form the corresponding 95% 
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Table 1. Average absolute standardized mean/proportion difference (Cohen’s d), average variance 
ratio between treatment and control group across all covariates and treatment effect and standard 
error estimates (Trt. (SE)) in the case study 


Method 

Cohen's d 

Variance Ratio 

Trt. (SE) 

Unadjusted 

Bayesian 

0.11 

1.26 

1.74 (0.77) 

Stratification 

Frequentist 

0.02 

1.07 

2.00 (0.76) 


Bayesian 
(Noninformative) 

0.03 

1.08 

2.05 (0.82) 

Weighting 

Frequentist 

0.01 

1.04 

2.26 (0.76) 


Bayesian 

(Noninformative) 

0.02 

1.12 

2.67 (1.17) 

Optimal Matching 

Frequentist 

0.09 

1.32 

2.23 (0.75) 


Bayesian 
(Noninformative) 

0.02 

1.07 

2.28 (0.90) 


PPI. In contrast, for one data set, the frequentist propensity score approach provides only 
one point estimate of each balance index. To compare with the interval estimate provided 
by the Bayesian approach, we bootstrap 1,000 data sets from the original data to obtain a 
95% bootstrap confidence interval for each balance index. 


RESULTS OF THE CASE STUDY 

Results of the case study are presented in Table 1 as well as in Figures 2 to 8. Table 1 
shows the average absolute standardized mean/proportion difference and variance ratio 
across all the covariates for different propensity score methods as well as the treatment 
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Figure 2. Initial covariate balance check in the case study. 
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Stratification—Conventional 
(+: continuous; o: binary) 


Stratification—Bayesian Noninfonmative 
(+: continuous; o: binary) 



Figure 3. Standardized mean/proportion difference and corresponding 95% bootstrap/posterior 
probability interval for stratification in the case study. 


effect estimates. In the Figures 2 to 8, the balance of each individual covariate is illustrated, 
where each continuous covariate is marked by a plus and each categorical level is marked 
by a circle. Figure 2 illustrates the initial imbalance in the sampled ECLS-K data set, and 
Figure 3 to Figure 8 present the interval estimates of the balance indices for propensity 
score stratification, weighting and optimal full matching methods. 

Overall, for the point estimates of Cohen’s d and variance ratio, the two-step Bayesian 
propensity score approach performs similarly compared to the frequentist propensity score 
methods across most of the continuous covariates and categorical levels. In terms of the 
average standardized mean/proportion difference and variance ratio, the frequentist propen¬ 
sity score methods provide slightly better covariate balance in stratification and weighting, 
whereas the two-step Bayesian method achieves better covariate balance under optimal 
full matching method. On average, both frequentist and Bayesian propensity score adjust¬ 
ment methods greatly reduce initial imbalance as shown in Table 1. After propensity score 
adjustments, the standardized mean/proportion differences are within the ideal range (± 
0.1 standard deviation) for most of the continuous/categorical covariates, and some covari¬ 
ates even achieve approximately zero standardized mean/proportion difference between 
the treatment group and control group as illustrated in Figure 3 to Figure 5. According to 
the balance criteria presented in Rubin (2001), variance ratios for most of the covariates 
in the treatment group and control group fall in the acceptable range (between 1/2 and 2) 
except for a few levels of categorical covariates. Many of the variance ratios are in the 
desirable range (between 4/5 and 5/4). 

In addition to point estimates, Figure 3 to Figure 5 also show the 95% bootstrap con¬ 
fidence intervals (for conventional frequentist approach) or posterior probability intervals 
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Weighting-Conventional 
(+: continuous; o: binary) 


Weighting-Bayesian Noninformative 
(+: continuous; o: binary) 




Std Mean Difference Std. Mean Difference 

Figure 4. Standardized mean/proportion difference and corresponding 95% bootstrap/posterior 
probability interval for weighting in the case study. 


(for Bayesian approach) of the standardized mean/proportion differences for the different 
propensity score methods. A 95% interval that falls in the desirable range indicates good 
covariate balance. For the stratification method, the PPIs provided by the two-step Bayesian 
propensity score approach are similar to the bootstrap intervals estimated by the frequentist 
propensity score approach. For the weighting method, a greater number of frequentist boot¬ 
strap intervals fall into the desirable range compared to the the two-step Bayesian approach, 
where as for the optimal full matching method, the PPIs produced by the two-step Bayesian 
approach are consistently better than the bootstrap intervals, among which a 95% bootstrap 
interval falls out of the desirable range completely. 

Moreover, Figure 6 to Figure 8 display the 95% bootstrap confidence intervals or PPIs 
for the variance ratio of each covariate/categorical level between two treatment conditions. 
Similar to the point estimates, the frequentist propensity score approach provides slightly 
better bootstrap intervals for the stratification and weighting methods, where as the two- 
step Bayesian approach offers better posterior probability intervals for the variance ratio 
under the optimal full matching method. We note that due to some extreme bootstrap data 
sets, the upper bound of the variance ratio for one level of the family composition variable 
is approximately infinite under frequentist propensity score stratification, weighting and 
optimal full matching methods. Thus, no interval bar is drawn for that categorical level 
in Figures 6 to 8. The Bayesian propensity score approach shows unique benefits in this 
context because PPIs can be obtained simultaneously with point estimates based on the 
posterior distribution of the balance indices, where as bootstrap confidence intervals may 
sometimes be undesirable due to some extreme bootstrap samples. 
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Optimal Matching-Conventional Optimal Matching-Bayesian Noninformative 

(+: continuous; o: binary) (+: continuous; o: binary) 
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Figure 5. Standardized mean/proportion difference and corresponding 95% bootstrap/posterior 
probability interval for optimal full matching in the case study. 
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Figure 6. Variance ratio and corresponding 95% bootstrap/posterior probability interval for stratifi¬ 
cation in the case study. 
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Weighting-Conventional 
(+: continuous; o: binary) 
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Figure 7. Variance ratio and corresponding 95 % bootstrap/poterior probability interval for weighting 
in the case study. 
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Figure 8. Variance ratio and corresponding 95% bootstrap/posterior probability interval for optimal 
full matching in the case study. 
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In terms of the consistency of performance on covariate balance and treatment effect 
estimates, results presented in Table 1 indicate that the two-step Bayesian propensity score 
approach is similar to the frequentist approach with regard to both covariate balance and 
treatment effect estimation for the stratification method. For the weighting method, though 
Bayesian and frequentist approaches provide comparable covariate balance, their treatment 
effect estimates differ considerably. In contrast, for the optimal full matching method, 
Bayesian and frequentist propensity score approaches perform similarly on the treatment 
effect estimation, though their covariate balance results are distinct, implying that treatment 
effect estimation by the optimal full matching method is relatively robust to the remaining 
covariate imbalance. As the true treatment effect is unknown in the case study, we further 
discuss the link between covariate balance and treatment effect estimates in the following 
simulation study. 


DESIGN OF THE SIMULATION STUDY 

An important feature of Bayesian techniques is the capability of incorporating prior infor¬ 
mation from expert opinion or previous research to directly address uncertainty in model 
parameters. Kaplan and Chen (2012) have shown that accurate prior information regard¬ 
ing treatment effect yields higher bias reduction and improved treatment effect estimates. 
Therefore, to further explore the effects of prior information on the balance properties of 
the two-step Bayesian propensity score approach and to further examine the link between 
covariate balance and treatment effect estimation, we conduct a simulation study in which 
200 samples, each with size of 1,000, are bootstrapped from the observed data in the case 
study to mimic the real data. The same 14 covariates are adopted. Also, the balance indices 
and propensity score methods utilized in the simulation study are the same as those in the 
case study. 

The parameter estimates in the propensity score model of the case study serve as true 
parameters to generate true propensity scores for the simulation study. Then we calculate the 
treatment assignment vector T by comparing the true propensity scores e/(x) to a random 
variable 11, generated from the Uni form(0 , 1) distribution, where i refers to the ith person 
(i = 1,2,...,«). We assign 7’ = 1 if (/,■ < <?,(*) and 7} = 0 otherwise. We then obtain 
the outcome variable y by a linear model with the generated treatment assignment and 
covariates as predictors. The true treatment effect is set as 2.29 (the estimate by covariate- 
adjusted regression in the case study) and the coefficients of covariates are set as 0.3 or 
-0.3. 2 

Three approaches were evaluated in the simulation study: (a) the frequentist propensity 
score approach; (b) the two-step Bayesian propensity score approach with noninforma- 
tive prior, reflecting the situation of having little prior information; and (c) the Bayesian 
propensity score approach with informative prior, where the generating propensity score 
parameters are used as prior means with prior precision 100 (i.e., prior variance 0.01) to 
represent the situation of having strong prior information. In practice, accurate prior infor¬ 
mation can be elicited from previous research, established theory, or expert opinion. When 
researchers are in possession of accurate prior information, typically small prior variance 
(high precision) will be specified. When researchers are not certain about accuracy of the 
prior information, a prior with a variance (low precision) will be utilized. As with the 

2 Detailed R code is available upon request. 
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case study, the “bayesglm" function of the arm package (Gelman et al., 2011) is used for 
estimating parameters in the Bayesian logit model. 


RESULTS OF THE SIMULATION STUDY 

Before presenting the results, we note that a common nonidentifiability problem within 
frequentist logistic regression, pointed out by Gelman, Jauklin, Pittau, and Su (2008), 
occurred in the simulation study. Specifically, some covariates perfectly separated the 
treatment group and control group, particularly for the binary predictors. This could occur, 
for example, if the treatment group is composed of all male participants and the control 
group is composed of all female participants. A common strategy for overcoming this 
problem is to remove the predictors that cause the separation until the model is identified. 
This strategy is incorporated as the default in the “glm” function of the R “stats” package 
(R Development Core Team, 2011) for frequentist logistic regression. For this simulation 
study, Bayesian logistic regression with noninformative priors removes the same predictors 
that cause the nonidentifiability problem in frequentist logistic regression. In addition, for 
the weighting methods, we use a mild informative normal prior with prior mean 0 and 
prior precision 0.01 as the “noninformative” prior for the Bayesian approach in order to 
avoid extreme propensity score weights. However, for the two-step Bayesian propensity 
score approach with informative priors, nonidentifiability in the logistic regression was not 
a problem, even with the binary predictors that cause the separation. This indicates the 
important role that prior information can play in improving model identification, especially 
for logit models. 

The average absolute standardized mean/proportion differences and average variance 
ratios across all the covariates as well as average treatment effect estimates across 200 
bootstrapped data sets are presented in Table 2. The average balance of each individual 


Average Covariate Balance in PS Stratification-Simulation Study: Conventional(left), Bayesian Noninformative(middle) & Informative(right) 

(♦: continuous; o: binary) 





SW Mean Difference 


Std Mean Difference 
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Figure 9. Average covariate balance in stratification in the simulation study. 
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Table 2. Average absolute standardized mean/proportion difference (Cohen’s d), average variance 
ratio between treatment and control group across all covariates and treatment effect and standard 
error estimates (Trt. (SE)) in the simulation study 


Method 

Cohen's d 

Variance Ratio 

Trt. (SE) 

Stratification 

Frequentist 

0.02 

Infinite 

2.39 (0.39) 


Bayesian 

(Noninformative) 

0.03 

Infinite 

2.38 (0.53) 


Bayesian 

(Informative) 

0.02 

Infinite 

2.92 (0.47) 

Weighting 

Frequentist 

0.52 

1.41 x 10 10 

2.16(0.40) 


Bayesian 

(Noninformative) 

0.17 

8.58 x 10 3 

2.07 (0.69) 


Bayesian 

(Informative) 

0.13 

1.21 

2.69 (0.49) 

Optimal Matching 

Frequentist 

0.0033 

1.03 

2.24 (0.40) 


Bayesian 

(Noninformative) 

0.0028 

1.03 

2.29 (0.58) 


Bayesian 

(Informative) 

0.0028 

1.03 

2.77 (0.56) 


Average Covariate Balance in PS Optimal Full Matching-Simulation Study: Conventional(left). Bayesian Noninformative(middle) & Informative(right) 

(+: continuous; o: binary) 



Std Mean Difference Std Mean Difference Std Mean Difference 


Figure 10. Average covariate balance in optimal full matching in the simulation study. 


covariate across 200 bootstrapped data sets are illustrated in Figure 9 and Figure 10. 3 
Overall, the frequentist propensity score approach performs similarly to the two-step 

3 The balance performance of two categorical levels are not shown in Figure 8 due to infinite 
variance ratios. The balance properties for the weighting method are not illustrated in the simulation 
study due to huge variance ratios in the frequentist propensity score approach and the two-step 
Bayesian approach with noninformative prior. 
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Bayesian propensity score approach on covariate balance for stratification and optimal 
full matching methods in terms of average Cohen’s d as shown in Table 2. Two categorical 
levels out of 35 continuous covariates/categorical levels yield infinite variance ratios for 
the stratification method for both the frequentist and Bayesian approaches. This is mainly 
because the commonly used stratification method utilizes only five strata, and therefore it 
is hard to balance some highly imbalanced binary covariates. However, the optimal full 
matching method matches the control group with the treatment group much more finely 
and provides the best and most stable variance ratios for both the frequentist and Bayesian 
approaches compared to the stratification and weighting methods (the Cohen’s d ranges 
from 0.0028 to 0.0033, and the variance ratios are all very close to 1), suggesting the overall 
superiority of the optimal full matching method. 

For the weighting method, the two-step Bayesian propensity score approach provides 
better balance than the frequentist approach in terms of standardized mean/proportion dif¬ 
ferences. In regard to the variance ratio, both frequentist and two-step Bayesian propensity 
score approaches with noninformative prior yield very large variance ratios, which are due 
to some extreme propensity score estimates (close to 0 or 1) and thus extreme weights. 
However, by incorporating precise priors, the two-step Bayesian propensity score approach 
offers reasonable balance in regard to the variance ratio, indicating the need to elicit ac¬ 
curate prior information especially when data are extreme. Overall, the weighting method 
provides worse balance and treatment effect estimates, indicating that the performance of 
weighting can easily be affected by extreme propensity score weights. 

In terms of the effects of prior information, on average, the two-step Bayesian propen¬ 
sity score approach with informative priors performs similarly on covariate balance com¬ 
pared to the same Bayesian approach but with noninformative prior. This finding indicates 
that although the two-step Bayesian propensity score approach with accurate prior infor¬ 
mation regarding treatment effect can provide better treatment effect estimates as shown 
in Kaplan and Chen (2012), the prior information with regard to propensity score model 
parameters have less impact on the covariate balance. 

For the treatment effect estimates, frequentist approach and Bayesian approach with 
noninformative priors perform similarly as expected. Among all the methods, the Bayesian 
two-step method using optimal full matching with noninformative priors on the propen¬ 
sity score model offers the least biased treatment effect estimate (2.29, bias = 0). These 
results agree with the findings in Kaplan and Chen (2012) that the optimal full matching 
method provides more accurate treatment effect estimates for both frequentist and Bayesian 
propensity score approaches. 

With regard to the consistency of performance on covariate balance and treatment effect 
estimation, Table 2 shows that there is a link between covariate balance and treatment effect 
estimates for the frequentist and Bayesian noninformative approaches under the optimal 
full matching method, where the Bayesian propensity score approach with noninformative 
priors offers better covariate balance and less biased treatment effect estimate compared to 
the frequentist approach. Also, the optimal full matching method outperforms the stratifica¬ 
tion and weighting method in terms of both covariate balance and treatment effect estimates 
for frequentist and Bayesian noninformative approaches. Despite achieving good covariate 
balance, the Bayesian propensity score approach with informative priors provides the most 
biased treatment effect estimates across different methods compared to the frequentist and 
Bayesian noninformative approaches, which may be because the same highly precise prior 
obtained from the case study was used for all the 200 bootstrapped data sets so that the 
prior information may not be reliable for some data sets. In practice, with only one data set, 
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precise and accurate priors for the propensity score model parameters may have a different 
impact on the treatment effect estimation. 


CONCLUSION 

The two-step Bayesian approach for propensity score analysis proposed by Kaplan and Chen 
(2012) maintains a fully Bayesian specification and naturally accounts for uncertainty in the 
propensity scores by specifying prior distributions on the propensity score model parameters 
while avoiding the concerns surrounding the joint modeling of Bayesian propensity score 
model and outcome model. This article examined the balancing properties of the two-step 
Bayesian propensity score approach via a case study and a real-data based simulation 
study for propensity score stratification, weighting and optimal full matching methods, 
and demonstrated a way of evaluating covariate balance in Bayesian propensity score 
analysis via point estimates and posterior distribution of balance indices. In addition, the 
link between covariate balance and treatment effect estimation was evaluated. 

To summarize, results of the case study revealed that both Bayesian and frequentist 
propensity score approaches substantially reduced initial imbalance and their performance 
on covariate balance was similar in regard to the standardized mean/proportion differences 
and variance ratios in the treatment group and control group. Similar performance was also 
found with respect to the corresponding 95% bootstrap intervals and posterior probability 
intervals. Specifically, the frequentist propensity score approach provided slightly better 
covariate balance for the propensity score stratification and weighting methods, where as 
the two-step Bayesian approach offered slightly better covariate balance for the optimal 
full matching method. The link between covariate balance and treatment effect estimation 
varied across different propensity score methods. 

Results of the simulation study showed that both Bayesian and frequentist propensity 
score approaches achieved good covariate balance for stratification and optimal full match¬ 
ing methods. The Bayesian propensity score approach with informative priors showed 
similar balance performance compared to the Bayesian approach with noninformative pri¬ 
ors, indicating that the specification of the prior distribution does not greatly influence the 
balance properties of the two-step Bayesian approach. The optimal full matching method, 
on average, offered the best covariate balance and the least biased treatment effect estimates 
compared to stratification and weighting methods for frequentist and Bayesian noninfor¬ 
mative propensity score approaches. Overall, the two-step Bayesian optimal full matching 
approach with noninformative priors performed the best in terms of both covariate balance 
and treatment effect estimate. 

One benefit of conducting Bayesian propensity score analysis is that we can obtain a 
distribution of estimated propensity scores and thus a distribution of corresponding balance 
indices (Cohen’s d and variance ratio in this article) so that the variation in balance indices 
can be captured in addition to the point estimates to assist in balance checking. Good 
balance is achieved if both the point estimates and the posterior probability intervals of the 
balance indices fall into the desirable range. We note that in the simulation study, some 
extreme bootstrap data sets led to extreme propensity score estimates and thus distorted 
variance ratios of covariates between treatment group and control group. However, after 
incorporating precise prior information, the extreme data sets were pulled toward the prior 
distribution and, as a result, reasonable propensity score estimates and balance indices were 
obtained. This is another benefit of the Bayesian propensity score approach. 
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One limitation we noticed in the simulation study is that the Metropolis acceptance 
rate in MCMC posterior sampling can be very small for the Bayesian logit model when 
the prior is improper and/or data are extreme. Utilizing a proper prior may improve the 
acceptance rate of the Metropolis sampling algorithm, which is a topic for future research. 
Using other computing algorithm such as the EM algorithm used in this article provides a 
possible solution to this problem. 

In this article, we investigated the mean and variance balance for covariates in the linear 
response surface. When there are nonlinear relationships between covariates and outcome, 
covariate balance in higher order moments such as variance, squared terms and interaction 
terms should be given extra attention because higher order imbalance could lead to bias 
even if covariate mean balance has been achieved (Hill, 2008). The balance performance for 
the two-step Bayesian propensity score approach with nonlinear confounders is warranted 
for further study. 

We also want to make a note here that misspecification in the propensity score model can 
lead to poor covariate balance and biased treatment effect estimates (Kang & Schafer, 2007), 
where as misspecification in the outcome model can bias the treatment effect estimates 
even more (Austin & Mamdani, 1993). In addition to carefully choosing covariates and 
specifying the functional form of the propensity score model and outcome model, Bayesian 
nonparametric modeling procedure, such as Bayesian Additive Regression Trees, can be an 
alternative, which focuses on the flexible modeling of the outcome and is shown to produce 
more efficient estimates in the nonlinear settings (Hill, 2011). 

To conclude, this article evaluated the balancing properties of the two-step Bayesian 
propensity score approach and linked it with the treatment effect estimation. Consistent 
with the findings of Kaplan and Chen (2012), we find that the two-step Bayesian propen¬ 
sity approach under optimal full matching is recommended in terms of both establishing 
covariate balance and reducing bias in the estimated treatment effect. 
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